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Abstract. The random chemistry algorithm of KaufFman can be used to determine an 
unknown subset S' of a fixed set V . The algorithm proceeds by zeroing in on S through a 
succession of nested subsets Vq = V D Vi D ■ ■ ■ D S. In Kauffman's original algorithm, the 
size of each is chosen to be half the size of Vi-i. In this paper we determine the optimal 
I sequence of sizes so as to minimize the expected run time of the algorithm. 
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t^H 1. Introduction 

00 

^ Consider the following set-guessing game between a responder and a questioner. The two 

players first agree on integers n > k > 0. The game begins with the responder secretly 
|V] choosing a subset S C [n] = {1,2, ... ,n} of cardinality k. The questioner's task is to 

Ph determine S. During each turn, the questioner proposes a set V; the responder then indicates 

^ whether or not V contains S. The game ends when the questioner proposes V = S. 

"c^ In [1], Kauffman proposes a Random Chemistry (RC) algorithm for determining the set 

^ S when k <^ n. In following this algorithm, the questioner zeroes in on S by choosing 

' successively smaller sets. More precisely, she creates a sequence of sets [n] = Vq D Vi D 

fS| V2 D ■ ■ ■ D Vm = S as follows. Given a set Vi_i known to strictly contain S, the questioner 

proposes random subsets of Vi_i of cardinality |Vi_i|/2 until she finds one containing S. This 
set is then taken to be V^. 
00 By allowing the ratios |\^|/|V^_i| to deviate from a ratio of 1/2, we obtain a more general 

^ class of algorithms. In Theorem [2] we determine the sequence of ratios that minimizes the 

^ expected number of turns in the game. 

^ The above set-guessing problem has appeared in disparate applied contexts. In fact, Kauff- 

^ man developed his RC algorithm as a way to hypothetically search for auto-catalytic sets of 

>- molecules. Eppstein et al. [2J later implemented an RC algorithm as a way of searching for 

small sets of non-linearly interacting genetic variations in genome-wide association studies. 



X 

More recently, Eppstein and Hines [3] applied a slight variation of this algorithm to search 
^ for collections of multiple outages leading to cascading power failures in models of electrical 

distribution grids. 

A number of closely related problems have also been widely studied. Most notably, the 
field of group testing (see, for example, [1] for an overview) is also concerned with finding 
unknown sets. In group testing, a positive test result occurs when the pooled group V has 
a nonempty intersection with the unknown set S. In the problem we discuss, a positive test 
result occurs only when V contains all of S. 
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Another class of related problems is that of searching games in which the responder can 
lie. The Renyi-Ulam game [6l [7] is probably the most famous of these; see [5] for a survey 
of such games. 

In Section |2] we state precisely the optimization problem we are trying to solve. In Sec- 
tion [3] we solve the continuous analog of the problem while Section |4] presents numeric data 
regarding how well the continuous solution mirrors the discrete one. Section [5] presents 
an approximate solution to the problem of Section [2] as an application of the calculus of 
variations. 



2. The Optimization Problem 

Assume k, n and S are as given as in the Introduction; set no = n and Vq = S. If we 
choose a proper subset of Vq of size rii such that uq > rii > k, then 

/no-k\ 

/I \ \n\ — k) 

(1) Pi = -pr 

\ni) 

is the probability of obtaining a subset containing S. The expected number of times we 
would have to select a subset of size ni until we find one containing S is therefore l/pi- 

Now consider a sequence ni,n2,...,nM where Uq > rii > n2 > ■ ■ ■ > um = k. Our 
Generalized Random Chemistry (GRC) algorithm begins by selecting sets of size ni until 
we find one, Vi, such that Vi D S. Such a V must exist since, by hypothesis, Vq D 5 and 
ni > k. We then select subsets of size n2 from Vi until we find a set V2 containing S. The 
process continues until we have chosen Vm- Define pi as the probability of selecting a set Vi 
of size ni containing S from a set Vi-i of size nj_i known to contain 5*. Then, as in ([T|, 

/ni_i-fc\ 
V rii—k / 



(2) P^ 



\ rii ' 



Let the random variable Xi represent the number of selections needed to find a set Vi 
containing S as described above. Then the expected value of Xi is 1/pi {Xi is a geometric 
random variable). Let X = Xi + ■ ■ ■+Xm denote the random variable representing the total 
number of selections until we find S. This presents the following 



Problem 1. How should one choose M < n^ — k and ni,n2, . . . ,nM subject to n^ > ni > 
77,2 > ■ ■ ■ > riM = ^ so as to minimize 

3. The Continuous Solution 

In the combinatorial formulation leading to Problem [T| the rij are all integers. In this 
section we relax this requirement and provide an optimal solution when the rij are only 
required to be real. Accordingly, we replace the factorial functions with P-functions; recall 
that P(^) = {n — 1)\ for n a positive integer. 

Theorem 2. The expected number of steps in the GRC algorithm 

M , M 



P(n,_i + l)P(n, - + 1) 

Pi ^ T{ni + l)P(ni_i - k + 1) 

1=1 1=1 \ / \ / 



is minimized over the real numbers for pi = ('^°) The optimal value of M is In {^^) , 

which case the expression for the pi reduces to pi = and E[X] = eln ('^°) . 



Proof. Let Zi = 1/pi and note that Hi^i ^« ~ (1°)- Then the problem reduces to finding 
M and zi.,. . . ,zm that minimize 'YltLi subject to X]f=i = C where C = In ('^°) and 

> 1 for 1 < i < M. 

The method of Lagrange multiphers instructs us to minimize 



M 



(4) ^2, • • • , ^M, A) = 2:^ - A 



M 



Y.Hz,)-c 



i=l 



where A is the Lagrange multipher. Differentiating with respect to Zi, zm and A, setting 
the derivatives equal to zero and solving gives the solution Zi = Zi = 
the optimal value for M, note that "Y^^i = Me^l^ . This function is minimized when 
M = M = C = In (^°). The optimal values for the probabilities are then pi = e"*^/*^ = 
and the expected number of steps using the pi is then E[X] = J^fLi ^/Pi = ('^"). □ 

While Theorem |2] gives closed forms for the optimal values of M and the Pi, we do not 
obtain a simple expression for the n,. For fixed M, the optimal sequence ni,n2, . . . ,nM-i 
can be obtained by successively solving the equations 



(5) 



{no\ / X 1/M 



no-k\ I ^ 



/no-k\ 
\ri\—k) 



The equations are easily solved using a univariate root finding algorithm. 

Alternatively, we can modify equation rt5| by using the approximation (^) (a/b)^. (This 
approximation works best for b <^ a.) In doing so, we find that the optimal values of the rij 
are 

(6) m ^ fc^/^^nj-^/^ 

and that the optimal value of M is approximately A; In (^)- As shown in Section sj this 
approximate solution can be obtained directly by applying the binomial coefficient approxi- 
mation to equation ([s]) and then applying the calculus of variations. 

From Theorem [2| the optimal solution has the property that the pi are constant. The 
following corollary follows from the well-known fact that a sum of independent, identically 
distributed geometric random variables has the negative binomial distribution. 

Corollary 3. Fix uq, k and M. Then the sequence of ni, n2, . . . , um minimizing E[X] in- 
duces the distributional property that X ~ Negative Binomial{M , p) where p = (^^) ^/^'^ ^ 
That is, X has probability mass function P{X = x) = — pY~^p^ for x = 

M,M + 1,.... 

The utility of Corollary |3] is that it provides a straightforward means for computing the 
probability that more than I steps would be required to find the solution using the optimal 
sequence. 
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Figure 1. 



4. Numerical data 

Equation JgI) gives a closed form approximate solution to the continuous problem motivated 
by Problem [1} To utilize this solution in actual problems, we need to map the to integers. 
A simple method is to provisionally set each according to (|6]). Then, for i = M — 1, M — 
2, ...,2,1 set Hi = max(r;,j_|_i + 1, [nj). In this section we will explore several examples 
in order to show that the integer-valued approximate solutions compare favorably with the 
continuous minimum of 

Example 4. Let no = 100 and k = 5. The optimal number of partitions given by Theorem|2] 
is Mqp^ = In ('^°) 18. Using this optimal value for M, we computed values forni, . . . , um-i 
using equations ([s]) and ([6]). Figure [l] illustrates the very similar sequences that result. The 
expected number of steps are 49.3 for the exact solution and 50.5 for the approximate 
solution. 



We also simulated 100, 000 runs of the GRC algorithm for the same values of riQ = 100 
and k = 5. In each experiment, we generated a set S and counted the number of steps to 
find it using the optimal sequence on the integer scale. The mean number of steps was 50.9. 
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Figure 2. Comparison of Negative Binomial distribution to empirical distri- 
bution for number of steps based on 100,000 simulations where hq = 100 and 
k = 5 

Figure [2] shows the empirical distribution for the number of steps with the negative binomial 
distribution overlaid. 



Finally, in Figure |3] we compare the optimal solution of Theorem [2] to that of Kauffman's 
original RC algorithm. Three different rij-sequences are illustrated for uq = 100, k = 5: The 
exact solution corresponding to the solution of equation [s] (circles), the approximate solution 
of [6] (squares), and the sequence generated by Kauffman's RC algorithm (triangles). The 
paired increasing plots illustrate the cumulative expected number of guesses required to find 
a given set VJ. 



5. Calculus of Variations 

In this section we present an alternate derivation of the approximately optimal formula 
for the Ui given in (|6]). In passing from the discrete to the continuous, we can hope that an 

5 



i 



Figure 3. Comparison of various solutions to the subset-guessing problem: 
equation ([s]) (circles), equation ^ (squares), and Kauffman's original RC 
algorithm (triangles). The joined plots give the corresponding expected run 
times on a logarithmic y-scale. 



optimal solution y{x) to 

T{y{x) + l)T{{y{x)-l)-k + l) 



dx 



yields a good solution to the original sum. 

The calculus of variations is designed for exactly this sort of problem: It can be used to 
find functions y{x) that are extrema of the integral 

/ f{y{x),y'{x);x)dx. 

J a 

In this case we would set = y{i)- Unfortunately, it is not clear that the required compu- 
tation is tractable. So we use the aforementioned approximation 

r(nj-i + l)r(ni - A; + 1) ^ / n^.i 
^ ^ T{ni + l)r(n,_i - A; + 1) V 

In order to write the righthand side of ([T]) in the form f{y(x),y'{x)] x), we use the approxi- 
mation y'{i) ^ y{i) — y{i — V) . Simple algebra then shows that y{i — l)/y{i) ~ 1 — y'{i)/y{i). 
Hence we can set f{y{x),y'{x);x) = (1 — y'{x)/y{x))^. Euler's equation then tells us that 



we should look for solutions to the differential equation 

dy dx dy' 

Straightforward computations yield 

dl 
dy 

dy' ^ y) ( y 

d df 

= HTA k: — III I — I 

dx 



k (^1 
k ( 1 



k-1 , 

y \ y 



y J y 

k-l / 



= k(k-~l){l-y-^'^ 

dx dy' \ y 



y_ 
y 



y 



k 1 



, , fc-1 / 

y \ y 



y 



One family of solutions will be those satisfying y' = y, i.e., y{x) = Ce^. However, these are 
increasing functions so we ignore them. Other solutions are those satisfying 



k 1 



y'\ y 



y J y 



k{k-i) 



dx 



y 



-1 



A; 1 - 



y \ y 



y J y 



This is satisfied by those y for which ^(1 — y' /y) = or, equivalently, those for which y' /y 
is a constant. This implies y{x) = Cie^'^^ . 

The constraints ?/(0) = no and y{M) = um = k imply that our approximate solution is 



y[x) 



noe' 



no 



k 
no 



x/M 



n, 



l-x/M,x/M 
n ru 



By setting x = i G {0, 1, . . . ,M}, it follows that we can set n^ = k^^^^nl , 
equation ([6]). 



as seen m 
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